docs: add operations documentation guides#309
docs: add operations documentation guides#309WentingWu666666 wants to merge 18 commits intodocumentdb:mainfrom
Conversation
Add six new operations guides covering day-2 cluster management: - backup-and-restore: conceptual overview, on-demand/scheduled backups, restore workflow, retention policy, and troubleshooting - scaling: vertical scaling (instancesPerNode 1-3) and PVC storage expansion with prerequisites and monitoring - upgrades: operator, extension, and gateway upgrade procedures, rolling update behavior, and rollback protection - failover: local automatic and cross-cluster manual failover, testing procedures, and application connection considerations - restore-deleted-cluster: recovery from backup or retained PV, verification steps, and common pitfalls - maintenance: monitoring, log management, resource tuning, node maintenance, rolling restarts, and routine checklists Update mkdocs.yml with new Operations navigation section. Refs documentdb#253 Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Add YAML front matter (title, description, tags) to all 6 operations docs - Rewrite Overview sections: what the operation is + why it matters - Disambiguate all bare 'cluster' to 'DocumentDB cluster' or 'Kubernetes cluster' - Disambiguate 'operator' to 'DocumentDB operator' in upgrades doc - backup-and-restore: add CSI link, multi-region section, tabbed prerequisites, YAML block titles, restore constraints, cross-ref to networking for mongosh - restore-deleted-cluster: route Method 1 to backup-and-restore, remove internal details section, add YAML title, cross-ref to networking for verify step - scaling: replace unsupported storage expansion with link to storage config - upgrades: remove unnecessary backup step from operator upgrade, replace heredoc with YAML block, use placeholder versions instead of fake 1.2.0 - maintenance: fix broken link to removed storage-expansion anchor - Update configuration front matter descriptions to match actual content Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Replace "CNPG monitors/promotes/triggers" with "the operator monitors/promotes/triggers" in prose explanations across all operations docs (failover, maintenance, scaling, upgrades, backup-and-restore). Resource names like clusters.postgresql.cnpg.io are preserved in kubectl commands that users need to run. Also restructures several sections into Material for MkDocs tabbed format for improved readability and fixes the troubleshooting namespace reference from cnpg-system to documentdb-operator. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Refine all operations documentation based on review feedback: - scaling.md: mirror structure for scale up/down tabs, fix "at least 2" for failover, remove unnecessary checklist - failover.md: fix networking cross-reference anchor, remove false connection pooling/quorum claims, fix replica read claim - upgrades.md: merge extension+gateway into single component upgrade (documentDBVersion upgrades both), move pre-upgrade checklist under component upgrades, simplify overview table, remove cluster health check from operator verify - backup-and-restore.md: convert on-demand/scheduled to tabs with API refs, fix CSI prerequisite wording, add YAML title, update retention policy to table format, improve backup identification step - maintenance.md: clarify logLevel scope (PostgreSQL only), remove fake resource allocation table, add PVC resize planned note, clarify cordon terminology - restore-deleted-cluster.md: fix broken anchor references - mkdocs.yml: reorder nav (failover before upgrades) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- failover: fix misleading write-only downtime claim to cover both reads and writes, add playground links for cross-cluster failover, explain instancesPerNode >= 2 requirement explicitly, merge behavior sections - maintenance: add normal/investigate guidance for each maintenance task so users know what to expect and when to troubleshoot - upgrades: add rollback sections with schema version check guidance (rollback if schema not upgraded, otherwise restore from backup) All failover doc claims verified against source code and tested in Kind cluster (3-instance cluster, primary deletion triggers automatic failover with data preservation confirmed). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Scaling operations (instancesPerNode, pvcSize changes) do not propagate to existing CNPG clusters due to the reconciliation loop gap documented in issue documentdb#306. Moving scaling doc to a separate branch until the operator bug is fixed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Upgrade doc fixes verified against source code and Kind cluster: - Fix downgrade behavior: operator skips schema migration but still updates images (not 'rejects the change') - Fix rolling update: primaryUpdateMethod=restart means primary is restarted in place (no switchover) - Fix health check: operator checks primary pod health, not all pods - Fix CRD handling: Helm crds/ dir only applies on install, not upgrade - Remove misleading 'zero-downtime' from description Maintenance doc cleanup: - Remove CNPG-internal Advanced Diagnostics section - Remove troubleshooting section with CNPG-specific commands - Remove broken link to scaling doc (moved to separate branch) - Reorganize Routine Checks section placement Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Fix CRD URL from microsoft/ to documentdb/ GitHub org - List all 3 CRDs (dbs, backups, scheduledbackups) instead of just 1 - Fix image override examples to use correct repo path: ghcr.io/documentdb/documentdb-kubernetes-operator/documentdb ghcr.io/documentdb/documentdb-kubernetes-operator/gateway All 24 claims in the upgrade doc verified against source code and local Kind cluster. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Fix backup status from 'Succeeded' to 'completed' (actual phase value) - Add missing metadata.name field to on-demand backup YAML example - Apply same status fix in restore-deleted-cluster doc Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…enance doc The DocumentDB CRD has spec.resource.storage (for PVC config) but no spec.resources.limits for CPU/memory. Replace with generic guidance based on kubectl top output. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The DocumentDB CRD (dbs.documentdb.io) exceeds the annotation size limit for client-side kubectl apply, causing 'metadata.resourceVersion: Invalid value: 0' errors. Switch to --server-side --force-conflicts which avoids this limitation. Verified in Kind cluster: CRD apply, helm upgrade (test->dev), and helm rollback all tested successfully with zero DocumentDB cluster disruption. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Remove non-existent INSTANCES column from kubectl get documentdb table - Fix pod label selector from documentdb.io/cluster to app=<cluster-name> - Fix PG log path from postgresql.log to /controller/log/postgres - Fix gateway container name from gateway to documentdb-gateway - Replace non-existent BackupSucceeded event with real BackupSchedule event - Replace non-existent FailoverCompleted event with real InvalidSchedule event - Fix PVRetained event name to PVsRetained (plural, matches source code) All fixes verified against Kind cluster and operator source code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The scaling operations doc was moved to the wentingwu/scaling-docs branch pending resolution of issue documentdb#306. Remove the nav entry to avoid a broken link in the docs build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Update the YAML description field in each operations doc so it accurately summarises the sections in that file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
There was a problem hiding this comment.
Pull request overview
This PR expands the public “Preview” documentation by adding a new Operations section (failover, upgrades, backup/restore, restore-deleted-cluster, maintenance) and updates several existing configuration page descriptions for clarity.
Changes:
- Adds new Operations documentation pages under
docs/operator-public-documentation/preview/operations/. - Updates
mkdocs.ymlnavigation to surface the new Operations section. - Refines YAML frontmatter
descriptiontext for networking, TLS, and storage configuration docs.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| mkdocs.yml | Adds “Operations” nav entries to expose new operational guides (but currently also keeps an existing “Backup and Restore” entry at the same level). |
| docs/operator-public-documentation/preview/operations/upgrades.md | New guide describing operator vs component upgrades and rollback considerations. |
| docs/operator-public-documentation/preview/operations/failover.md | New failover guide covering local and multi-region/cross-cluster promotion. |
| docs/operator-public-documentation/preview/operations/backup-and-restore.md | New backup/restore guide using VolumeSnapshots and Backup/ScheduledBackup CRs. |
| docs/operator-public-documentation/preview/operations/restore-deleted-cluster.md | New recovery guide describing restore via Backup or retained PVs. |
| docs/operator-public-documentation/preview/operations/maintenance.md | New maintenance guide covering health checks, logs, resource monitoring, and events. |
| docs/operator-public-documentation/preview/configuration/tls.md | Updates page description to better reflect supported TLS modes and content. |
| docs/operator-public-documentation/preview/configuration/storage.md | Updates page description to remove unsupported “volume expansion” claim. |
| docs/operator-public-documentation/preview/configuration/networking.md | Updates page description to highlight mongosh connection and Service types. |
docs/operator-public-documentation/preview/operations/backup-and-restore.md
Show resolved
Hide resolved
docs/operator-public-documentation/preview/operations/failover.md
Outdated
Show resolved
Hide resolved
cf17f84 to
39f59a0
Compare
- Add missing metadata.name to ScheduledBackup example - Fix GitHub org in failover cross-links (microsoft -> documentdb) - Remove duplicate top-level 'Backup and Restore' nav entry from mkdocs.yml Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The content is now covered by the Operations section: - operations/backup-and-restore.md (backup, restore, retention) - operations/restore-deleted-cluster.md (PV recovery) Update cross-references in faq.md and storage.md to point to new paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
| List backups for your DocumentDB cluster and choose one in `completed` status: | ||
|
|
||
| ```bash | ||
| kubectl get backups -n default |
There was a problem hiding this comment.
Nit: is it in default namespace?
| ### Step 4: Upgrade the DocumentDB Operator | ||
|
|
||
| ```bash | ||
| helm upgrade documentdb-operator documentdb/documentdb-operator \ |
There was a problem hiding this comment.
should we add helm upgrade --skip-crds as we upgraded the CRDs manually above?
| ``` | ||
|
|
||
| ### Rollback and Recovery | ||
|
|
There was a problem hiding this comment.
I think for automatic rollback we can utilize helm upgrade my-release my-chart --atomic?
| spec: | ||
| gatewayImage: "ghcr.io/documentdb/documentdb-kubernetes-operator/gateway:<version>" | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Should we talk about DocumentDB Cluster udpate? Once the operator or schema updates are done, we want to migrate cluster to newer versions.
|
|
||
| Backups protect your DocumentDB cluster against data loss from accidental deletion, corruption, or failed upgrades. A reliable backup strategy is the foundation of any production deployment — without it, recovery may be impossible. | ||
|
|
||
| The DocumentDB operator provides a snapshot-based backup system built on Kubernetes [VolumeSnapshots](https://kubernetes.io/docs/concepts/storage/volume-snapshots/). Each backup captures a point-in-time copy of the primary instance's persistent volume, which can later be used to bootstrap a new DocumentDB cluster. |
There was a problem hiding this comment.
point-in-time might not be th best word since it sounds like point-in-time restore... at a minimum explain that the data accumulated after a backup and before a crash might be lost
|
|
||
| ## Prerequisites | ||
|
|
||
| Before creating backups, ensure your Kubernetes cluster has the required snapshot infrastructure. |
There was a problem hiding this comment.
s/infrastrucure/support/g
| ## Local Automatic Failover | ||
|
|
||
| Local automatic failover requires at least two instances (`spec.instancesPerNode >= 2`). With a single instance, there is only the primary and no replica available to promote — so failover is not possible. When multiple instances are running, the operator automatically promotes a replica to primary if the current primary becomes unavailable. | ||
|
|
There was a problem hiding this comment.
we recomend to match the # of local replicas to the number of availability zones
|
|
||
| In a multi-region setup: | ||
|
|
||
| - One DocumentDB cluster is designated as the **primary** and handles all writes. |
There was a problem hiding this comment.
primary can be setup as a "HA cluster" thus having replicas providing local HA and only necessitating a faiolver to another region under extraordinary cisrcumstances...
|
|
||
| ## Log Management | ||
|
|
||
| === "DocumentDB Operator Logs" |
There was a problem hiding this comment.
We ecommend to set up a centralzied lof colelction as part of your observability strategy (see observanilty chapter)
|
|
||
| ```yaml | ||
| spec: | ||
| logLevel: "info" # Options: debug, info, warning, error |
There was a problem hiding this comment.
why do we default to info? In prod it should run warn or error?
| ``` | ||
|
|
||
| ### Step 2: Review Available Versions | ||
|
|
There was a problem hiding this comment.
Note: per release polciy (see ...) we only support ...
|
|
||
| | Upgrade Type | What Changes | How to Trigger | | ||
| |-------------|-------------|----------------| | ||
| | **DocumentDB operator** | The Kubernetes operator itself | Helm chart upgrade | |
There was a problem hiding this comment.
please also specify tht we upgarde CNPG for you - is there a way to skip that?
|
|
||
| ## Component Upgrades | ||
|
|
||
| Updating `spec.documentDBVersion` upgrades **both** the DocumentDB extension and the gateway together, since they share the same version. |
There was a problem hiding this comment.
we shoudl probably explain how we ensure that everyhting is deployed before we upgrade the scheam on multi-region. This statement youw rote is confusing because it impleas the schema gets updated automatically hich we don't want in multi-region
| 1. You update the `spec.documentDBVersion` field. | ||
| 2. The operator detects the version change and updates both the database image and the gateway sidecar image. | ||
| 3. The underlying cluster manager performs a **rolling restart**: replicas are restarted first one at a time, then the **primary is restarted in place**. Expect a brief period of downtime while the primary pod restarts. | ||
| 4. After the primary pod is healthy, the operator runs `ALTER EXTENSION documentdb UPDATE` to update the database schema. |
backup-and-restore.md: - Replace 'point-in-time copy' with 'crash-consistent snapshot' and clarify that PITR is not supported (data loss between snapshot and failure) - s/infrastructure/support/ in prerequisites - Use <namespace> placeholder instead of hardcoded 'default' failover.md: - Add tip: match instancesPerNode to number of availability zones - Clarify that primary cluster can itself be multi-instance HA, reducing need for cross-region failover maintenance.md: - Add centralized log collection recommendation with link to telemetry playground - Change logLevel example to 'warning' and add production tip upgrades.md: - Document that CloudNative-PG is bundled and upgraded automatically - Add release strategy support window note - Add --skip-crds to helm upgrade (CRDs are applied manually) - Add --atomic tip for automatic rollback - Add cross-link from operator upgrade to component upgrades - Document multi-region upgrade order (standbys first, primary last) - Document multi-region schema migration behavior (primary-only) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
helm upgrade does not touch CRDs at all (per Helm docs), so --skip-crds is a no-op and misleading. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
This PR adds 5 operations documentation guides for the DocumentDB Kubernetes Operator, covering day-to-day cluster management tasks.
New Documentation
Verification
Every command, event name, label, container name, and path in these docs was verified against:
Key decisions
wentingwu/scaling-docs) blocked on issue Reconciliation loop does not propagate spec changes to existing CNPG clusters #306 (reconciliation loop doesn't propagate spec changes to existing clusters)--server-side --force-conflictsplainkubectl applyfails for the largedbs.documentdb.ioCRDhelm upgradeensures new CRD fields are available when the operator startsspec.resourcesreferences DocumentDB CRD only hasspec.resource.storage, not CPU/memory limitsAlso includes
CONTRIBUTING.mdwith MkDocs documentation testing instructionsmkdocs.ymlnavigation (removed scaling.md)Closes #253